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I ^ ■ Abstract 

The ratio of two probability densities can be used for solving various machine learning 
tasks such as covariate shift adaptation (importance sampling) , outlier detection (likelihood- 
^ ' ratio test) , and feature selection (mutual information) . Recently, several methods of directly 

C/3 . estimating the density ratio have been developed, e.g., kernel mean matching, maximum 

likelihood density ratio estimation, and least-squares density ratio fitting. In this paper, 
we consider a kernelized variant of the least-squares method and investigate its theoretical 
^ ' properties from the viewpoint of the condition number using smoothed analysis techniques — 

. the condition number of the Hessian matrix determines the convergence rate of optimization 

and the numerical stability. We show that the kernel least-squares method has a smaller 
. condition number than a version of kernel mean matching and other M-estimators, implying 

that the kernel least-squares method has preferable numerical properties. We further give 
an alternative formulation of the kernel least-squares estimator which is shown to possess 
' an even smaller condition number. We show that numerical studies meet our theoretical 

. analysis. 



^ ■ 1 Introduction 



The problem of estimating the ratio of two probability densities is attracting a great deal of 
attention these days, since the density ratio can be used for various purposes such as covariate 
shift adaptation (Shimodaira, 2000; Zadrozny, 2004; Sugiyama &: Miiller, 2005; Huang et al., 
2007; Sugiyama et al., 2007; Bickel et al., 2009), outlier detection (Scholkopf et al., 2001; Tax 
& Duin, 2004; Hodge & Austin, 2004; Hido et al., 2008), and divergence estimation (Nguyen 
et al, 2008; Suzuki et al., 2008). 

A naive approach to density ratio estimation is to hrst separately estimate two probability 
densities and then take the ratio of the estimated densities. However, density estimation is 
known to be a hard problem particularly in high-dimensional cases unless we have simple and 
good parametric density models (Vapnik, 1998; Hardle et al., 2004), which may not be the case 
in practice. 

Recently, methods of directly estimating the density ratio without going through density 
estimation have been developed. The kernel mean matching (KMM) method (Huang et al., 2007) 
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directly gives estimates of the density ratio by matching the two distributions efficiently using 
a special property of universal reproducing kernel Hilhert spaces (RKHSs) (Steinwart, 2001). 
Another approach is an M-cstimator (Nguyen et al., 2008) based on non-asymptotic variational 
characterization of the /-divergence (Ali & Silvey, 1966; Csiszar, 1967). See also Sugiyama 
et al. (2008a) for a similar algorithm under the Kullback-Leibler divergence. Non-parametric 
convergence properties of the M-estimator in RKHSs have been elucidated under the Kullback- 
Leibler divergence (Nguyen et al., 2008; Sugiyama et al., 2008b). A squared-loss version of the 
M-estimator for linear density-ratio models called unconstraint Least-Square Importance Fitting 
(uLSIF) has been developed and has been shown to possess useful computational properties, e.g., 
a closed-form solution is available and the leave-one-out cross-validation score can be analytically 
computed (Kanamori et al., 2009). 

In this paper, we consider a kernelized variant of uLSIF (KuLSIF) and analyze its properties 
in numerical optimization from the viewpoint of the condition number. The condition number of 
the Hessian matrix of objective function plays a crucial role (Lucnbcrger &; Ye, 2008; Bertsekas, 
1996), i.e., it determines the convergence rate of optimization and the numerical stability. When 
an objective function to be optimized is randomly chosen and fed into an optimization algorithm, 
the computational cost of an algorithm can be assessed by the distribution of the condition 
number. The distribution of condition numbers of randomly perturbed matrices has been studied 
by the name of smoothed analysis (Spielman & Teng, 2004; Sankar et al., 2006). Smoothed 
analysis was originally introduced to explain the success of algorithms and heuristics that could 
not be well-understood through traditional worst-case and average-case analysis — it gives a more 
realistic analysis of the practical performance of algorithms. 

We apply smoothed analysis techniques to derive the distribution of the condition number 
of density-ratio estimation algorithms. More specifically, we first give a unified view of the 
objective functions of KuLSIF and KMM. Then we show that KuLSIF has a smaller condition 
number than an "induction" variant of KMM, implying that KuLSIF is more preferable than 
KMM in optimization. We further show that KuLSIF — which could be regarded as an instance 
of M-estimators — has the smallest condition number among all M-estimators in the min-max 
sense (i.e., the worst condition number over all density ratio functions is the smallest in KuLSIF). 
We also give probabilistic evaluation of the condition number of M-estimators and show that 
KuLSIF is favorable. These theoretical findings are also verified through numerical experiments. 
We further give an alternative formulation of KuLSIF which is denoted as Reduced-KuLSIF, 
and show that it possesses an even smaller condition number. 

The rest of this paper is organized as follows. In Section 2, we formulate the problem 
of density ratio estimation and briefly review existing methods. In Section 3, we describe 
the KuLSIF algorithm, and show its fundamental properties such as the convergence rate and 
availability of the analytic-form solution and the analytic-form leave-one-out cross-validation 
score. Section 5 is the main contribution of this paper, giving condition number analysis of 
density ratio estimation methods. In Section 6, we give an alternative formulation of KuLSIF 
by transforming loss functions and show that is possesses an even smaller condition number. 
In Section 7, we experimentally investigate the behavior of the condition numbers, confirming 
validity of our theories. In Section 8, we conclude by summarizing our contributions and showing 
possible future directions. 
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2 Estimation of Density Ratio 

We formulate the problem of density ratio estimation and briefly review existing methods. 



2.1 Formulation and Notations 



Consider two probability distributions P and Q on a probability space Z. Assume that both 
distributions have the probability densities p and q, respectively. We assume p{x) > for all 
X e Z. Suppose that we are given two sets of independent and identically distributed (i.i.d.) 
samples, 



Xi,. . . , X 
Our goal is to estimate the density ratio 



i.i.d. „ 
~ P, 



Yi,...,Y„ 



i.i.d. 



Q. 



(1) 



woix) 



p{x) 



(>0) 



based on the observed samples. 

We summarize some notations to be used throughout the paper. For a vector a in the 
Euclidean space, ||a|| denotes the Euclidean norm. Given a probability distribution P and a 
random variable h{X), we denote the expectation of h{X) under P by / hdP or / h{x)P{dx). 
Given samples Xi, . . . , Xn from P, the empirical distribution is denoted by P„. The expectation 
/" hdPn denotes the empirical means of h{X), that is, ^ X^^Li ^i^i)- Let || • ||oo be the infinity 
norm, and ||-||p be the L2-norm under the probability P, i.e. ||/i||p = / \h\'^dP. For a reproducing 
kernel Hilbert space (RKHS) H (Scholkopf Sz Smola, 2002), the inner product and the norm on 
TC are denoted as {■,-)h and || ■ ||-^, respectively. 

Below we review several approaches to density ratio estimation. 

2.2 Kernel Mean Matching 

The kernel mean matching (KMM) method allows us to directly obtain an estimate of wq (x) at 
Xi, . . . ,Xn without going through density estimation (Huang et al., 2007). 

The basic idea of KMM is to find wq{x) such that the mean discrepancy between non-linear ly 
transformed samples drawn from P and Q is minimized in a universal reproducing kernel Hilbert 
space (Steinwart, 2001). We introduce the definition of universal kernel below. 

Definition 1 (Steinwart (2001)). A continuous kernel k on a compact metric space Z is called 
universal if the RKHS H of k is dense in the set of all continuous functions on Z, that is, for 
every continuous function g on Z and all s > 0, there exists an f eH such that || / — y||oo < £• 
The corresponding RKHS is called universal RKHS. 

The Gaussian kernel is an example of universal kernels. Let 7^ be a universal RKHS endowed 

with the kernel function k : Z x Z — > K. For any x G Z, the function k{-,x) is regarded as an 
element of TC. Then, it has been shown that the solution of the following optimization problem 
agrees with the true density ratio wq: 



1 



mm — 

w 2 



w{x)k{-, x)P{dx) 



K;y)Q{dy) 



H 



I 



S.t. / wdP = 1 and w > 0. 
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Indeed, when w = wq, the loss function equals to zero. An empirical version of the above 
problem is reduced to the following convex quadratic program: 



s.t. 



^ ra ^ m n 

^^^^^ ^ E ^^-^^MXi, X,) --EE ^^k{X^,Y,), 

1 " 

- Vu-i - 

n ^ — ^ 



i=l 



j=l i=l 

<e and < wi,W2, ■ ■ ■ ,Wn < B. 



(2) 



Tuning parameters, B > and e > 0, control the regularization effects. The solution wi, . . . , Wn 
is an estimate of the density ratio at the samples from P, i.e., wq{Xi)^ . . . ^WQ{Xn)- Note 
that KMM does not estimate the function wq on Z but the values on sample points (i.e., 
transduction) . 

2.3 M-estimator based on /-divergence Approach 

An estimator of the density ratio based on the /-divergence (Ali & Silvey, 1966; Csiszar, 1967) 
has been proposed by Nguyen et al. (2008). Let (/? : 3f? — > 3^ be a convex function, then the 
/-divergence between P and Q is defined by the integral 



I{P,Q) = J ^{q/p)dP 



Setting ip{z) = — logz, we obtain the KuUback-Leibler divergence as an example of /- 
divergences. Let the conjugate dual function ^ of (/9 be 

il^{z) = sup{zu — <p{u)} = — inf {<f{u) — zu}. 
USSR «eSft 



When is a convex function, we also have 

ip{z) = - 



■ inf {ip{u) — zu}. 



Substituting (3) into the /-divergence, we obtain another expression. 



I{P,Q) = -inf 



Jip{w)dP- Ji 



wdQ 



(3) 



(4) 



where the infimum is taken over all measurable functions w : 2 ^ The infimum is attained 
at the function w such that 



q{x) 
p{x) 



where ip' is the derivative of Approximating (4) with the empirical distributions P„ and 
Qm, we obtain the empirical loss function. This estimator is referred to as the M-estimator 
of the density ratio. A more practical algorithm for the KuUback-Leibler divergence has been 
independently proposed in Sugiyama et al. (2008a). 
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When an RKHS H is employed as a statistical model, an estimator is obtained by minimizing 
the loss function which approximates (4) over 7^, 

inf J 'ip{w)dPn - J wdQm + ^\\w\\y^, weH. (5) 

The density ratio wq is estimated by ip'{w{x)), where w is the minimizer of (5). The regu- 
larization term with the regularization parameter A is introduced to avoid overfitting. 

In the RKHS H, the representer theorem (Kimeldorf &; Wahba, 1971) is applicable, and the 
optimization problem on H is reduced to a finite dimensional optimization problem. Statistical 
convergence properties of the kernel estimator for the Kullback-Leibler divergence have been 
investigated in Nguyen et al. (2008) and Sugiyama et al. (2008b). 

2.4 Least-squares Approach 

The linear model 

b 

w{x) = '^aihi{x) (6) 

i=l 

is assumed for estimation of the density ratio u^o, where the coefficients ai,. . . ,ai, are the pa- 
rameters of the model. The basis functions hi, i = 1,. . . ,b are chosen so that the non-negativity 
condition hi{x) > is satisfied. A practical choice would be the Gaussian kernel function 
hi{x) = e"!!^""^'!! /^'^ with appropriate kernel center Ci & Z and kernel width a (Sugiyama 
et al., 2008a). 

The unconstraint least-square importance fitting (uLSIF) (Kanamori et al., 2009) estimates 
the parameter a based on the square error: 

wofdP = \J w^dP - j wdQ + wldP. 

The last term in the above expression is a constant and can be safely ignored when minimizing 
the square error of the estimator w. Therefore, the solution of the following minimization 
problem over the linear model, 

nun ^ j w^dPn- J wdQm + >^ ■ Reg{(x), (7) 

is expected to approximate the true density ratio wq, where the regularization term Reg(Q;) with 
the regularization parameter A is introduced to avoid overfitting. We define the column vector 
a = (ai, . . . , tth)'^ and the vector-valued function h{x) = {hi{x), . . . , h},{x))~^ . Substituting the 
linear model (6) into the objective function of (7), we obtain 

min Ha — g^ a + X -Tleg^a), (8) 

where H and g are the 6 by 6 matrix and the 6-dimensional vector defined as H = f hhJ dPn and 
g = [ hdQm, respectively. Let a be the minimizer of (8), then the estimator of wq is given as 
w{x) = J2i=i Ciihi{x). There axe several ways to impose the non-negativity condition w{x) > 
(Kanamori et al., 2009). Here, truncation of w defined as 

■f/}_l_(x) = max{{t;(x), 0} 
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is used for obtaining a non-negative estimator. 

Note that the loss function (4) with ip{z) = is essentially equivalent to the loss of uLSIF. 
uLSIF has an advantage in computation over other M-estimators: When Reg(a) = ||a|p/2, the 
estimator a can be obtained in an analytic form. As a result, the leave-one-out cross-validation 
(LOOCV) score can also be computed in a closed form (Kanamori et al., 2009), which allows 
us to compute the LOOCV score very efficiently. LOOCV is an (almost) unbiased estimator of 
the prediction error and can be used for determining hyper-parameters such as regularization 
parameter A or Gaussian kernel width a. 



3 Kernel uLSIF 

The purpose of this paper is to show that a kernclizcd variant of uLSIF (which we refer to as 
kernel uLSIF; KuLSIF) has good theoretical properties and thus useful. In this section, we 
formalize the KuLSIF algorithm and briefly show its fundamental properties. Then in the next 
section, we analyze the computational efficiency of KuLSIF algorithm from the viewpoint of the 
condition number. 



3.1 uLSIF on RKHS 

We assume that the model for the density ratio is an RKHS H endowed with a kernel function k 
on Z X Z, and we consider the optimization problem (7) on 7i. According to (7), the estimator 
w is obtained as 

min - [ w^dPn — [ tadQ^ + — ||u)||^, s.t. wGH. (9) 
w 2 J J 2 

The regularization term |||it;|||^ with the regularization parameter A (> 0) is introduced to 
avoid overfitting. The truncated estimator w+ = max{w, 0} may be preferable in practice; the 

estimation procedure of w or based on (9) is called KuLSIF. 

The following theorem reveals the convergence rate of the estimators id and id^. 

Theorem 1 (Convergence Rate of KuLSIF). Assume that the domain Z is compact. Let TC be 
an RKHS with the Gaussian kernel. Suppose that q/p = wq G H, and \\wo\\-}i < oo. Set the 
regularization parameter A = Xn,m so that 



lim Xn,m = 0, A J„ = 0((n A m)^ 

n.m— ♦oo ' 



where n Am = min{n, m} and S is arbitrary number satisfying < S < 1. Then the estimators 
w and satisfy 

\\w+-wo\\p < \\w-wo\\p = Op{X]l^), 
where \\ ■ \\p is the L-2-norm under the probability P. 

Proofs may be found in Appendix A. By choosing small (5 > 0, the convergence rate will 
get close to the order of 0{l/\/n A m) which is the convergence rate for parametric models. See 
Nguyen et al. (2008) and Sugiyama et al. (2008b) for similar convergence analysis under the 
KuUback-Leibler divergence. 
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Remark 1. Although Theorem 1 focuses on the Gaussian kernel, extension to the other kernels is 
straightforward. Let Z be a probability space, and k be a kernel function over Z xZ, and suppose 
sup^jg^ k{x, x) < oo. According to the proof of Theorem 1, we assume that the bracketing entropy 
Hb{S,Hm, P) is bounded above by 0{M/d)^ , where < 7 < 2 (see the proof in Appendix A for 
the definition). Then, we obtain 

1/2 



||?Z)+ - 'u;o||p < \\w-wo\\p = Op{X^ 
where = 0((n A m)^"^) with 1 - 2/(2 + 7) < J < 1. 

3.2 Analytic-form Solution of KuLSIF 

The problem (9) is an infinite dimensional optimization problem, if the dimension of TC is infinite. 
The representer theorem (Kimeldorf & Wahba, 1971), however, is applicable to RKHSs, and 
then, we immediately have the following theorem. 

Theorem 2. Suppose the samples (1) are observed. The estimator w given as the solution of 
(9) has the form of 

n m 

w{z) = J2aik{z,Xi) + J2P3H^,yj), (10) 

i=l j=l 

where ai, . . . ,an, Pi, ■ ■ ■ , (3m & ^■ 

The theorem follows a direct application of the original representer theorem, so we omit its 
proof. This theorem shows that the estimator w lies in a finite dimensional subspace of H. 

Furthermore, for KuLSIF (i.e., the squared- loss) , the parameters in w{z) can be obtained 
analytically. Let Ku, K12, K21, and K22 be the sub-matrices of the Gram matrix: 

{Kn)ii' = k{Xi,Xi,), {Ku)ij = k{Xi,Yj), K21 = KJ2, (^^22)^/ = HYj,Yj,), 

where i, i' = 1, . . . , n, j,/ = 1, . . . , m. Let 1^^ = (1, . . . , 1)^ G JR™ for positive integer m. Then 
the estimated parameters a, and Pj are given as follows. 

Theorem 3 (Analytic Solution of KuLSIF). Suppose that the regularization parameter A is 
strictly positive. Then the estimated parameters in KuLSIF are given as 

a = (ai,...,an)^ = \- {Ku + nXIn)~^ Kulm, (H) 

mA 

P = {Pu---,PmV = ^^m, (12) 

where In is the n by n identity matrix. 

Proof. We start to prove the theorem for general M-estimator based on /-divergences. We 
consider the minimization problem of the loss function 



■il){w)dPn - J wdQm + 
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subject to 

n m 

w = ^ajk{;Xj) + ^Pek{;Ye). 

3=1 i=l 

Suppose is a differentiable convex function. Let v{a,f3) G be a vector-valued function 
defined as 

(n m \ 

^ ajkiXi, Xj) + Pik{Xi, Ye)\, i = l,...,n, 
j=i e=i J 

where i/j' denotes the derivative of ip. Then, the extremal condition of the loss function is given 
as 

-Kiiv{a, /3) - —Ki2lm + XKiia + XKi2f3 = 0, and 
n m 

-K2lv{a,P)-—K22lrn + XK22l3 + XK2ia = 0. 

n m 

If Q and P satisfy the above conditions, they arc the optimal solution because the loss function 
is convex in a and /3. Substituting /? = ^Imj we obtain 

—Kiiv(a, Im/mX) + XKua = 0, and 

n 

—K2iv{a, Im/mX) + XK2ia = 0. 



Hence, if the equation 



— v(a, l^/mA) + Aa = (13) 

n 



has a solution, it is revealed that j3 = ;^lm is a part of the optimal solution. For ip{z) = 
we have 

v{a,l3)=Kua + Ki2(3, 

thus, (13) is reduced to 

(Ku + nXIn) a = -^K^lm- (14) 
mA 

The coefficient matrix is non-singular. Therefore, the estimator is represented by (11) and 
(12). □ 

Remark 2. As shown in the proof of Theorem 3, the estimate (5 for any f -divergence (5) is 
given as (12) (but not (11) j under the condition that Equation (13) has a solution with respect 
to a. 

Eventually, the estimator based on the /-divergence is given by solving the following opti- 
mization problem. 



inf J '^{w)dPn — j wdQ. 



X 2 



s.t. «;(•)= ^aiA;(-,Xj) + —^fc(-,yj), ai, G 3ft. 

i=i j=i 
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When il){z) = the problem (15) is reduced to 

min la^ f -i^n + XKu] a + ^— lI^KsiKna, ae^"" (16) 
a 2 \n ) nmX 

by ignoring the term independent of the parameter a. On the other hand, Theorem 3 guarantees 
that the parameter a in KuLSIF is obtained by the optimal solution of the following optimization 
problem ^: 

min ( -ivTii + A/„ ) a + -^l^K2ia, a G 3??". (17) 
a 2 \n / nmX 

The estimator given by solving the optimization problem (17) is denoted as Reduced-KuLSIF (R- 
KuLSIF). Although KuLSIF and R-KuLSIF share the same optimal solution, the loss function 
is different. In a later section, we make clear that R-KuLSIF is more preferable than the other 
estimators including KuLSIF from the viewpoint of numerical computation, especially when the 
sample size is large. 

3.3 Leave-one-out Cross-validation 

In addition to the solutions Ofj and the leave-one-out cross-validation (LOOCV) score can also 
be obtained analytically in KuLSIF. The accuracy of the KuLSIF estimator = max{K;,0} 
is measured by ^ / w\dP — f w^^dQ, which is equal to the square error of iu+ up to a constant 
term. Then the LOOCV score of W-\- under the square error is defined as 

= - ^fiye) ' (18) 

e=i ^ ^ 

where u;^^ = max{{t;(^) , 0} is the estimator based on the samples except X£ and The index 
of removed samples could be different, for example and yg^, but for the sake of simplicity, 
we suppose that the samples X£ and are removed in the computation of LOOCV. Hyper- 
parameters achieving the minimum value of LOOCV will be a good choice. 

Thanks to the analytic solutions (11) and (12), the leave-one-out solution w^^^ can be com- 
puted efficiently from w by the use of the Sherman- Woodbury-Morrison formula (Golub & Loan, 
1996). The detail of the analytic LOOCV expression is deferred to Appendix B — the derivation 
follows a similar line to (Kanamori et al., 2009) which deals with a linear model (10); a minor 
difference is that removing the sample {x£, y^) in KuLSIF changes the basis functions due to the 
kernel expression. 

4 Relation between KuLSIF and KMM 

We show the relation between KuLSIF and KMM. 

^We used the fact that the solution of Ax = 6 is given as the minimizer of ^x^ Ax — x, when A is positive- 
semidefinite. 
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We assume that the true density ratio wq = q/p is included in 7^. As shown in Section 2, 
the loss function of KMM on H is defined as 

LKMuiw) = ^\\^{w)\\j^, 



$H = J k{;x)w{x)P{dx) - J k{;y)Q{dy). 



In the estimation phase, an empirical approximation of Lkmm is optimized in the KMM algo- 
rithm. On the other hand, the (unregularized) loss function of KuLSIF is given by 



-^KuLSIf(w^) = \J "^^^^ ~ J 



wdQ. 



Both Lkmm and LkuLSIF are minimized at the true density ratio wq E H- Although some linear 
constraints may be introduced in the optimization phase, we study the optimization problems 
of Lkmm and LkuLSIF without constraints. This is because when the sample size tends to 
infinity, the optimal solutions of Lrmm and -LkuLSIF without constraints automatically satisfy 
the required constraints such as / wdP = 1 and w > 0. 

We consider the extremal condition of ijKuLSiF('i^) at wq. Substituting w = wq + S ■ v {S E 

V E H) into -Z^KuLSiF ('i^) , we have 

^KuLSIF (^i^O +(^^') - -^^KuLSIF(^i^o) ^ ^ WQVdP - j vdQ^+^ J V^dP. 

Since -Z^KuLSIF ('J^^o + <^^) is minimized at (5 = 0, the derivative of ijKuLSiF(w;o + 5v) at 5 = 
vanishes, i.e., 

'' WQvdP - J vdQ = 0. (19) 

The equality (19) holds for arbitrary v £ Ti. Using the reproducing property of the kernel 
function k, we can express (19) in terms of ^{wq) as follows, 

J WQvdP - J vdQ = J wo{x){k{-,x),v)nP{dx) - J {k{-,y),v)nQidy) 

= { J k{-,x)wo{x)P{dx) - J k{-,y)Q{dy), v)^ 

= {^{wo), v)^ = 0, ^■u G H. (20) 

Therefore, we obtain ^{wq) = and we find that $(u') is the Gateaux derivative (Zeidler, 
1986) of ivKuLSiF at ii; G K. In summary, let -D-LkuLSIF be the Gateaux derivative of LkuLSIF 
over the RKHS 7^, then, the equality 

LkMm{w) = ^\\DLKvihSlF{w)\\l^ (21) 

holds. Tsuboi et al. (2008) have pointed out a similar relation for M-estimator based on 
Kullback-Lciblcr divergence. 

Now we illustrate the relation between KuLSIF and KMM by showing an analogous opti- 
mization example in the Euclidean space. Let / : ^f?*^ — >^ 5? be a differentiable function, and 
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consider the optimization problem mm^fix). At the optimal solution xq, the extremal condi- 
tion V/(xo) = should hold, where V/ is the gradient of /. Thus, instead of minimizing /, 
minimization of ||V/(a;)|p also provides the minimizer of /. This corresponds to the relation 
between KuLSIF and KMM: 

KuLSIF ^ min /(a;), 

X 

KMM ^ min -||V/(x)f . 
X 2 

In other words, in order to find the solution of the equation 



^{w) = 0, (22) 
KMM tries to minimize the norm of ^(w). The "dual" expression of (22) is given as 

mw),v)n = 0, "^ven. (23) 



By "integrating" {^{w),v)'^{, we obtain the loss function LkuLSIF- 

Remark 3. Gretton et al. (2006) have proposed the maximum mean discrepancy (MMD) to 
measure the discrepancy between two probabilities P and Q. When the constant function 1 is 
included in the RKHS H, the MMD between P and Q is equal to 2 x Lkmm(1)- Due to the 
equality (21), we find that the MMD is also expressed as ||I?i>KuLSlF(l)||f^) that is, the norm, of 
the derivative of LkuLSIF at 1 &H. This quantity will be related to the discrepancy between the 
constant function 1 and the true density ratio wq = q/p. 

Remark 4. It is straightforward to extend the above relation to the general f -divergence ap- 
proach. The loss function of the M-estimator (Nguyen et al, 2008) is given as 



L^ijjiw) = J il){w)dP — J wdQ. 
Then, the loss function of the KMM-type may be defined as 

L^p-kmm{w) = ^\\DL,^{w)\\^, 

where 

DL^.kmm{w) = J k{-,x)ip'{w{x))P{dx) - J k{-,y)Q{dy). 

We can confirm that L,^{w) and L^_kmm{w) share the minimizer. If there exists w^ij, G H. such 
that wq = il^^w^), the optimal solution is given by w^. 



5 Condition Number Analysis for Density Ratio Estimation 

We have elucidated basic properties of the KuLSIF algorithm. In this section, we study the 
condition number of KuLSIF and other density ratio estimators in order to investigate compu- 
tational properties. This is the main contribution of this paper. 
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5.1 Condition Number in Numerical Analysis and Optimization 

Condition numbers play crucial roles in numerical analysis and optimization (Demmel, 1997; 
Luenberger k, Ye, 2008; Sankar et al., 2006), which is explained in this section. 

Let A be a symmetric positive definite matrix, and the condition number of A is defined as 
Amax/Amin (> 1), where Amax and Amin are the maximal and minimal eigenvalues of A, respec- 
tively. The condition number of A is denoted by k,{A). In general, the condition number for a 
matrix which may not be symmetric is defined through the singular values. The above definition 
is, however, enough for our purpose. 

In numerical analysis, the condition number governs the round-off error of the solution of a 
linear equation Ax = b. The matrix A with a large condition number will lead to a large upper 
bound on the relative error of the solution x. More precisely, in the perturbed linear equation 
{A + SA){x + Sx) =b + Sb, the relative error of the solution is given as follows (Demmel, 1997): 

iifaii < 'HA) ( \m I \m\ 

\\x\\ - 1-k{A)\\6A\\/\\A\\ \\\A\\ + 

Hence, smaller condition number is preferable in numerical computation. 

In optimization problems, the condition number determines the convergence rate of optimiza- 
tion algorithms. Let us consider a minimization problem min-r f{x), x € 3^**, where / : 3fi" — > 3fi 
is a differentiable function and let xo be a local optimal solution. We consider an iterative 
algorithm which generates a sequence {xil^^. In various iterative algorithms, the sequence is 
generated as 

Xi+i = Xi- S-^Vf{xi), i = 1, 2, . . . , (24) 

where Si is an approximation of the Hessian matrix of / at xq, i.e., V^/(.xo). Then under a mild 
assumption, the sequence {xi}^i converges to xq. Numerical techniques such as scaling and 
pre-conditioning are also incorporated in the above form with a certain choice of Si. According 
to Section 10.1 in Luenberger and Ye (2008), the convergence rate of such iterative algorithms 
is given as 




where is the condition number of S^ {V'^ f{xo))S- . Thus, the convergence rate of the 
sequence Xk is slow if Ki is large. More critically, when {Ki}^^ does not converge to one, the 
sequence {xi}'^^ does not converge to xq at a super-linear rate. 

When the condition number of the Hessian matrix V^/(xo) is large, there is a trade-off 
between the numerical accuracy and the convergence rate in optimization problems. Let us 
illustrate the trade-off using a few examples. When the Newton method is employed, Sk is given 
as V^f{xk). Because of the continuity of V^/, the condition number of Sk = V'^f{xk) would be 
large if k(V^/(xo)) is large. Then the numerical computation of S^^V f {xk) becomes unstable. 
When the quasi-Newton methods such as the BFGS method or the DFP method (Luenberger 
& Ye, 2008) are employed, S^ or S"^^ is successively estimated based on the information of 
the gradient. If At(V^/(xo)) is large, K,{Sk) is also likely to be large, and thus, the numerical 
computation of S^^V f{xk) is not reliable, even when S^^ is successively updated in the quasi- 
Newton methods. The round-off error caused by nearly singular Hessian matrices significantly 
affects the accuracy of the quasi-Newton methods. As a result, it may not be guaranteed that 
S^'^Vf{xk) is a preferable descent direction of the objective function /. 
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In optimization problems with large condition numbers, the numerical computation tends 
to be unreliable. To avoid numerical instability, the Hessian matrix is often modified so that 
Sk has a moderate condition number. For example, the optimization toolbox in MATLAB® 
implements a gradient descent method in its function fminunc. The default method in fminunc 
is the BFGS method with update through the Cholesky factorization of (not •S^^)- Even 
if the positive definiteness of Sk is violated by the roimd-off error, the Cholesky factorization 
immediately detects the negativity of eigenvalues and the positive definiteness of Sk is recovered 
by adding a correction term. When the modified Cholesky factorization is used, the condition 
number of Sk is guaranteed to be bounded above by some constant, C. See (More & Sorensen, 
1984) in details. 

The trade-off between numerical accuracy and convergence rate is summarized by the fol- 
lowing equality: 

^ min ^.(5-V^(VV(xo))^-^/^) = ma.{ 1 }. (25) 

The proof of (25) may be found in Appendix C. We suppose that the symmetric positive definite 
matrix Sk satisfying K{Sk) < C is used in the iterative algorithm (24). If k(V^/(xo)) is large, 
the right-hand side of (25) will be greater than one. Hence, the convergence rate will be slow. 
That is, the quasi-Newton method with a modified Hessian Sk such that K{Sk) < C may not 
achieve a super-linear convergence rate. Even though some scaling or pre-conditioning technique 
is available, it is preferable that the condition number of the original problem is kept as small 
as possible. 

5.2 Condition Number Analysis of KuLSIF and KMM 

Let us consider the optimization problems in KuLSIF and KMM on an RKHS H endowed with 
a kernel function k over a set Z. Given samples (1), the optimization problems of KuLSIF and 
KMM are defined as 

(KuLSIF) min - [w'^dPn- [wdQrn + ^\\w\\t, weU, 
w 2 J J 2 

In-- ,,2 

(KMM) min - $(«;) -I- Aw L, weU, 
2 



w 



where 

$(w) = J k{-,x)w{x)Pnidx) - J k{-,y)Qm{dy). 

Here, (^{w) + \w is the Gateaux derivative of the loss function for KuLSIF including the reg- 
ularization term. In the original KMM method, the density ratio on samples Xi, . . . ,Xn are 
optimized (Huang et al., 2007), i.e., transduction. Here, we consider its inductive variant, i.e., 
estimating the function wq on Z using the loss function of KMM. According to Theorem 3, the 
optimal solution of (KuLSIF) is given as the form of w = Yl^=i'^ik{-,^i) + SjLi ^('j 
note that the optimal solution of (KMM) is also given by the same form. Thus, the variables to 
be optimized in (KuLSIF) and (KMM) are ai, . . . ,a„- 

We investigate the numerical efficiency of (KuLSIF) and (KMM). When we solve the min- 
imization problem mina;/(x), it is not recommended to minimize the norm of the gradient 
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minj. ||V/(x)|p, since the problem miiia. ||V/(a:)|p generally has a larger condition number than 
min^ /(x) (Luenberger & Ye, 2008). For example, let / be the convex quadratic function de- 
fined as f{x) = Ax — b^x with a positive-definite matrix A. Then the condition number 
of the Hessian matrix equals to k{A). On the other hand, the Hessian matrix of the function 
||V/(x)|p = ll^x — 6|p is equal to k{A'^) = k{A)'^, that is, the condition number is squared and 
thTis becomes larger. Below, wc show that the same is true of KuLSIF and KMM. 
The Hessian matrices of the objective functions of KuLSIF and KMM are given as 

^KuLSIF = -Kf^ + XKn, (26) 
n 

Hkmm = ^Kl + —Kl^ + \''Kii. (27) 

-ff KuLSIF is derived from (16), and i^KMM is given by direct computation based on (KMM). 
Then, we obtain 

^(-ffKuLSIF) = K{Ku)K(-Kn + Mn), 

n 

1 2 

I^{HkMm) = K{Kii)Ki-Kii + Mn) . 

n 

Since the condition number is larger than or equal to one, the inequality 

'^(-f^KuLSIF) < I^{HkMm) 

holds. This implies that the convergence rate of KuLSIF well be faster than that of KMM, when 
an iterative optimization algorithm is used to minimize each loss function. 

According to Remark 4, we expect that the condition number of M-cstimator based on 
is smaller than that of KMM based on L^_kmm • Let each Hessian matrix at optimal solution w 
be -ff^-div for and -ff^-KMM for L^.kmm, then some calculation provides 

11 -f^V,"'-^ll + X^n^ -^11 ) 
HtP-KMM = ^—KK^ Dtp^iS^^li^ + Xln^ -^11^' 

where -D^,ui is the n by n diagonal matrix defined as 

V r{w{Xn))) 





n 








if 






{ n 



(28) 



and ij)" denotes the second-order derivative of ^. Hence, using the inequality k{AB) < k{A)k{B) 
(Horn & Johnson, 1985), we have 

«(^V-div) < K{Ku)K{^Kl(^Dtjj^{ijKli^ + XIn), 
«(-ffV-KMM) < l^{Ku)K{^Kl('^D,i,^i;sKli'^ + Xln)'^. 

From the viewpoint of the naive upper bound of condition numbers, the M-estimator based on 
will be preferable to KMM with -L^_kmm- 
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5.3 Condition Number Analysis of M-Estimators 

(K)uLSIF is an example of the M-estimators with the squared loss. Here, we study the condition 
number of the Hessian matrix associated with the minimization problem in the /-divergence 
approach, and show that KuLSIF is optimal among all M-estimators based on /-divergences. 
More specifically, we will give a min-max evaluation (Section 5.3.1) and a probabilistic evaluation 
(Section 5.3.2) of the condition number. 

5.3.1 Min-max Evaluation 

We assume that a universal RKHS H (Steinwart, 2001) endowed with a kernel function k on 
a compact set Z is used for estimation of wq. The M-estimator based on the /-divergence is 
obtained by solving the problem (15). The Hessian matrix of the loss function at the optimal 
solution w is equal to 

-KiiD^^y,Ku + \Kii, (29) 
n 

where -D^,^, is the diagonal matrix defined as Eq. (28). The condition number of the Hessian 
matrix is denoted by 

In KuLSIF, we find ip" = 1, and thus, the condition number is equal to Ko{In)- We analyze the 
relation between Ko(-^n) and ko(-D^,w)- 

Theorem 4 (Min-max Evaluation). Suppose that H is a universal RKHS, and that Ku is 
non-singular. Then, 

inf sup Ko(-Dw,,^) = HQ{In) (30) 

i/':V"(l)=l w&i 

holds. Here the infimum is taken over all convex second-order continuously differentiable func- 
tions ip such that ■0"(1) = 1- 

The proof is deferred to Appendix D. When the constraint ■4^"{l) = c is imposed with some 
c > 0, the optimal function is given as V'(^) = cz^/2 in the min-max sense. Practically, the value 
of tp"{l) determines the balance between the fitting to training samples and the regularization 
term. Theorem 4 guarantees that KuLSIF minimizes the worst-case condition number, which 
is brought by the fact that the condition number of KuLSIF does not depend on the optimal 
solution. Since both sides of (30) depend on the samples Xi,... , , KuLSIF achieves the 
min-max solution in terms of the condition number for each observation. 

5.3.2 Probabilistic Evaluation 

Next, we study probabilistic evaluation of the condition number. As shown in min-max evalua- 
tion, the Hessian matrix is given as 

H = —KiiD^^ijjKii -\- XKii, 
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where the diagonal elements of D^ ^ij are equal to ip" {w{Xi)), . . . , ip"{w{Xn))- The estimator w 
is given as the minimum solution of (15). Let us define the random variable T„ as 

Tn = max i>"{w{Xi)), 

l<i<n 

and Fn be the distribution function of T„, then T„ is a non- negative random variable. 

Below, we first compute the distribution of the condition number k{H). Then we investigate 
the relation between the function ip and the distribution of condition number k,{H). We need 
to study the eigenvalues and the condition numbers of random matrices. For the Wishart 
distribution, the probability distribution of condition numbers has been investigated by Edclman 
(1988); Edelman and Sutton (2005). Recently, the condition number of matrices perturbed by 
additive Gaussian noise have been investigated by the name of smoothed analysis (Sankar et al., 
2006; Spielman &; Teng, 2004; Tao &; Vu, 2007). Randomness involved in the matrix H defined 
above is, however, different from that in existing works. 

Theorem 5 (Probabilistic Evaluation). Let H be a RKHS endowed with a kernel function k on 
Z satisfying the following condition: there exists £ > such that 

^/£ < k{x,x') < I, yx,x'GZ. 

Assume that the Gram matrix Kn is almost surely positive definite in terms of the probability 
measure P. Suppose that there exists sequences s^ and t^ such that 

lim Sn = oo, lim -P„(sn) = 0, lim F„(t„) = 1, (31) 

and that there exists M > such that E[ip" {id{Xi))] < M holds for large sample size, n and m. 
Suppose that A = Xn,m. satisfies lim^^oo -^n,m < oo . Then, for any small v > 0, we have 

lim Pr f^-'^ < i^{H) < k{Ku){1 + ^)') = 1- (32) 

The proof is deferred to Appendix E. 

Remark 5. The Gaussian kernel on a compact set meets the condition of Theorem, 5 under a 
mild assumption on the probability P. If the distribution P of samples Xi, . . . , X„ is absolutely 
continuous with respect to the Lebesgue measure, the Gram matrix of the Gaussian kernel is 
almost surely positive definite. Because, Ku is positive definite if Xi ^ Xj for i ^ j- 

When ip is the quadratic function, ^(z) = z'^/2, the distribution function Fn is given F„(t) = 
l[t> 1], where ![•] is the indicator function. Hence, there does not exist a sequence s„ defined 
in Theorem 5. The upper bound is, however, still valid. That is, by choosing t„ = 1, the 
upper bound of k{H) with ^{z) = z'^/2 is asymptotically given as k{Kii){1 + X^ln)- 
other hand, in the M-estimator with Kullback-Leibler divergence (Nguyen et al., 2008), the 
function is defined as ■ip{z) = — 1 — log(— 2;), z < 0, and thus, ■^"(z) = holds. Hence, 
Tn = maxi<j<„('u}(Xj))^^ is expected to be of the order larger than constant order, and thus, t„ 
would diverge to infinity. This simple analysis indicates that the KuLSIF will be more preferable 
than the M-estimator with Kullback-Leibler divergence in the sense of computational efficiency 
and stability. 
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We derive an approximation of the inequality in (32). The target of the estimator w is given 
as w such that q{x)/p{x) = ^'(^i'(x)) holds. Thus, we expect that the condition number of 
^KiiD^ ,qKii + XKii is approximated by that of ^KuD^ ^^Kn + XKu. The proof of Theorem 
5 is valid even in the case that the random variable T„ is defined by a fixed function w gH. The 
condition number of Hessian matrix at a fixed function w e H is considered in the proposition 
below. 

Proposition 1 (Approximated Bound). The kernel function k and the regularization parameter 
X satisfy the same condition as Theorem 5. For a function w £ Tt, let F be the distribution 
function of tp"{w{X)), and suppose that the expectation of ip" {w{X)) is finite. Let G be 1 — F, 
and suppose that there exists a real number U > such that G{t) has the inverse function 
for t>U. Let the random matrix Hy, be 

Hw = —KiiF>ibwK\\ + XKii. 
n 

Then, for any small > and any small v > Q, we have 

lim Pr ({G-Hl/n^-")}^-'^ < niH^) < k(Ku) (l + X-^G-\l/n^+"))) = 1. 

n— »oo 

Proof. Note that Fn{t) in Theorem 5 is equal to {F{t))"', since i/}"{w{Xi)), i = l,...,n are 
identically and independently distributed form F. The condition number K{Hyj) satisfies Eq.(32) 
with Fn = F'^. 

As shown in Figure 1, the function G~^ is decreasing. Let s„ be s„ = G^^(l/n^^''), then 
s„ — GO holds when n tends to infinity. Thus, we have 

Fn{Sn) = F{SnT = (1 " G(s„))« = [l - -^0, ^ DO. 

On the other hand, let tn be tn = G~^{l/n^~^^), then we have 

Fnitn) = (1 - G(t„))" = (^1 - -1^ j ^1, n^oo. 

Substituting s„ and t„ into the inequality in (32), we obtain the result. □ 
Remark 6. Proposition 1 implies that for large n, the inequality 

{G-\l/n^-'^)Y-' < ^{H^) < K{Ki,){l + X-^G-\l/n^+^)) (33) 

holds in high probability. In KuLSIF, the function ip is given as ip{z) = jl, and the correspond- 
ing distribution function of each diagonal element in D^ yj is given by -FkuLSIf(c?) = 1[<^ > 1], 
and thus, GkuLSIf(c?) = 1 — -?^KuLSiF(d) = l[d < 1]. In all M-estimators except KuLSIF, di- 
agonal elements of D^^yj can take various positive values. We regard the diagonal elements of 
Dip,w CIS o, typical realization of random variables with the distribution function F{d). When the 
distribution function F is close to -FruLSIF) ih,^ function G = 1 — F is also close to GkuLSIF- 
Then, G~^ will take small values as illustrated in Figure 1. As a result, we can expect that the 
condition number of KuLSIF is smaller than that of the other M-estimators. In a later section, 
we further investigate this issue through numerical experiments. 
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Figure 1: If the function Gi{d) is closer to GkuLSIf(c?) (= 0) than G2{d) for large d, then ^{z) 
takes smaller value than G2^{z) for small z. 



Example 1. Let F^{d) he 



F^id) = 



0<d<l, 



1-^ l<d. 



Suppose that is the distribution function of if)" {w{X)) = i/;" [q{X) / p{X))) . Note that the 
distribution function FkuLSIf(c?) = l[d> 1] is represented asl[d>l] = lim^.^oo -^7('^) except at 
d = 1. Then, G^{d) = 1 — F^{d) is equal to 

fl 0<d<l, 



^ l<d. 



G^{d) = 

For small z > 0, the inverse function G~^{z) is given as 



G-\z) 



z 



-1/7 



7 

Hence for sufficiently small rj, the inequality (33) is reduced to 

n 7 < KiHy,) < Kfi^ii) 1 + A"' 
Both upper and lower bounds in the above inequality are monotone decreasing with respect to 7. 
Example 2. Let Fj{d) be 

FJd) = Kt^, d>0. 
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The distribution function -FkuLSIF (<^) = > 1] is represented as l[d > 1] = lim.j_^^ F^{d) 
except at d = 1. Then, Gj{d) = 1 — Fj{d) is equal to 

GJd) = :; ci>0. 

For small z, the inverse function G~^{z) is given as 

G-Hz) = l + -log^. 
' 7 z 

Hence for small r], the inequality (33) will lead the following: 



7 - ^ - ^ A7 

The upper and lower bounds in the above inequality are monotone decreasing with respect to 7. 

6 Reduction of Condition Numbers in KuLSIF 

The condition number in the optimization problem of KuLSIF is given as ^(i^KuLSiF) = 
k{^K^j^+XKii), and that of the original KMM method is equal to k{Kii) which is approximately 
derived from (2). On the other hand, the Hessian matrix of R-KuLSIF is equal to 

^^R-KuLSIF = -Kn+XIn- (34) 

n 

See (17) for the loss function of R-KuLSIF. Due to the equality 

'^(^^KuLSIf) = '«(-f^ll)K(i?R-KuLSIF), 

we have 

«(-f^^R-KuLSIF) < ^(-ffKuLSIF)- 

Moreover, it is easy to see 

K(iyR_KuLSIF) < k{Kii). 

These inequalities imply that R-KuLSIF is more preferable than KuLSIF and KMM in the sense 
of the convergent speed and numerical stability as explained in Section 5.1. 

In this section, we study whether reduction of condition numbers is possible in the general 
/-divergence approach. We do not consider scaling of the parameter (Luenberger &: Ye, 2008), 
but other types of transformation of loss functions in order to reduce the condition number. Our 
conclusion is that among all /-divergence approaches, the condition number is reducible only 
in KuLSIF. Thus the reduction of condition numbers by R-KuLSIF is a special property, which 
makes R-KuLSIF particularly attractive in practical use. 

We elucidate the reason why the condition number of KuLSIF can be reduced from 
'^(^KuLSIf) to K(i^R-KuLSiF)- As explained in Remark 2, in the /-divergence approach, the 



19 



optimal solution of /? is equal to 1^/mX. Then, as shown in the proof of Theorem 3, the 
gradient of the loss function with respect to a is equal to 

g^{a) = ^Kiiv{a,l„Jm\) + \Kiia, 

where the function v depends on ip. On the other hand, the gradient of the loss function in 
(17) is equal to K^^g,ij,{a) with '4){z) = /2. This fact implies that in KuLSIF, there exists a 
non-singular matrix C G ^f?"^", which is independent of a, such that Cg^{a) is identical to the 
gradient of a function F(a). If the condition number of the Hessian matrix of F(a) does not 
exceed ^(i^KuLSiF), it will be numerically more advantageous to use ^''(q:) as the loss function 
than KuLSIF. 

Suppose that the ^ft^-valued function Cg.^{a) can be represented as the gradient of a function 
F, that is, V-F = Cg^. Then, the function Cg^ is called integrable (Nakahara, 2003). What 
we study in this section is to find tp such that there exists a non-identity matrix C such that 
Cg^{a) is integrable. According to Nakahara (2003), the necessary and sufficient condition of 
integrability is that the Jacobian matrix of Cg^(a) is symmetric. 

The Jacobian matrix of Cg^{a) is equal to 

-CK^iD^^ocKii + XCKii, 
n 

where D^ ^ is the diagonal matrix in which the diagonal elements are given as 

(n 1 \ 

0=1 ^ e=i ' 

Let R be the n by n matrix CKu, then, the Jacobian matrix is represented as 

M^,R{a) = -RD^^aKii + XR. 
n 

Theorem 6. Let c he a constant value in 3?, and the function ■0 he second-order continuously 
differentiahle. Suppose that the Gram matrix Kn is non-singular, and that Kn does not have 
zero element. If there exists a non-singular matrix R ^ cKn such that M^^^(a) is symmetric 
for any a G 3^?", then, tp" is a constant function. 

The proof may be found in Appendix F. Theorem 6 guarantees that the condition number of 
the loss function is reducible only when is a quadratic function. Here, multiplying the gradient 

by a matrix C, which is independent of a, is allowed as transformation of the loss function. For 
other functions ip, the gradient Cg^^a cannot be integrable unless C = cin, c G 3f?. 

Remark 7. We summarize the theoretical results on condition numhers. Let IL^-div be the 
Hessian matrix (29) of the M-estimator. Then, the following inequalities hold, 

K(ifR_KuLSIF) < KiKll) < K,{HkuLSw) < h{HkMm), 
«;(-ffKuLSIF) = sup «;(-ffKuLSIF) < SUp n{LI^-dw)- 

wen wen 

Rememher that Kn is the Hessian matrix of the original (transductive) KMM method, and 
Hkmm is its inductive variant. Based on probabilistic evaluation, the inequality 

«(-ffKuLSIF) < K{H^.div) 
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will also hold with high probability. Let H^.kmm be the Hessian matrix of the loss function 
L-ijj-KMM in Remark 4- Then, we conjecture that 

i^{H,p.div) < i^{H^-kmm) 

holds in some sense as an extension of the relation between KuLSIF and the inductive variant 
of KMM. Consequently, R-KuLSIF will be advantageous in numerical computation. 

7 Simulation Results 

In this section, we experimentally investigate the behavior of the condition numbers. In the 
inductive variant of KMM estimator, the Hessian matrix is given by -ffKMM defined in (27). In 
the M-estimator based on /-divergence, the Hessian matrix involved in the optimization problem 
is given as 

n 

For the Kullback-Leiblcr divergence, we have <f{z) = — logz and il^{z) = — 1 — log(— z), z < 0, 
and thus, Tp'{z) = —1/z and ip"{z) = l/z"^ hold for z < 0. If the optimal solution provides the 
true density ratio wq, we obtain ip" {w{x)) = ij^" {wo{x))) = wo{x)'^. Thus, the Hessian 
matrix is given as 

On the other hand, in KuLSIF, the Hessian matrix is given by i^KuLSlF defined in (26), and the 
Hessian matrix of R-KuLSIF, i?R-KuLSiF, is shown in (34). In examples of Section 5.3.2, we 
considered the condition number of a random matrix 

i^RND = -i^iidiag(di,...,d„)i^ii + AKnG3fi"^'*. 

n 

We use F^[d) defined in Example 1 with various 7 as the distribution function oi di, . . . ,dn. 
The condition numbers of Hessian matrices, -ffKMM, -f^KL, ^^KuLSIF, -f^R-KuLSiF, and -^rnd are 
numerically compared. In addition, the condition number of Kn is also computed. In the 
original transductivc KMM estimator defined by (2), the condition number of the loss function 
is equal to k{Kii). Thus, the convergence rate of numerical optimization in KMM would be 
approximately governed by ^(i^ii) — we need to take the constraints in (2) into account to derive 
more accurate convergence rate of the original KMM. 

The probability densities of P and Q are set to be both the normal distribution on the 
10-dimensional Euclidean space with the unit variance-covariance matrix Iiq. The mean vectors 
of P and Q are set to x lio and /x x lio with /x = 0.2 or /x = 0.5, respectively. Note that 
the mean value /v, afi"ects only k{H]^i). The true density ratio wq is determined by P and Q. 
In the kernel-based estimators, we use the Gaussian kernel with width cr = 2 or cr = 4. Note 
that £7 = 4 is close to the median of the distance between samples \\Xi — Xj\\; using the median 
distance as the kernel width is a popular heuristics (Scholkopf & Smola, 2002). The sample 
size from P is equal to that from Q, that is, n = m. The regularization parameter A is set to 
An,m = l/(™ A m)^'^ which meets the assumption in Theorem 1. 

Table 1 shows the experimental results. In each setup, samples and diagonal 

elements di , . . . , ci„ are randomly generated and the condition number is computed. The table 
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shows the average of the condition numbers over 1000 runs. As shown in Table 1, the condition 
number of R-KuLSIF is much smaller than the other methods for all cases. Thus, it is expected 
that in optimization, the convergence speed of R-KuLSIF is faster than the other methods and 
that R-KuLSIF is robust against numerical degeneracy. It will be worthwhile to point out 
that K(i?R-KuLSiF) is smaller than k{Kii). This is because the identity matrix in -ffR-KuLSiF 
prevents the smallest eigenvalue from becoming extremely small. The number of Ac(-ffRND) is 
decreasing as 7 tends to large values, and seems to converge to K(i?KuLSlF)- This result meets 
the considerations in Remark 6 and Example 1. 

Table 2 shows the average number of iterations and the average computation time for solving 
the optimization problems over 50 runs. The probability densities of P and Q are the same as 
above ones, and the mean vector of Q is given as 0.5 x Iiq. The numbers of samples are 
set to (n,m) = (1000,1000), (4000,4000) or (6000,6000), and the regularization parameter is 
A = l/(n A m)°-^. The number of n is equal to the number of parameters to be optimized. R- 
KuLSIF, KuLSIF, inductive variant of KMM (KMM), and M-estimator with Kullback-Leibler 
divergence (KL) are compared. In addition, the computation time of solving the linear equation 
(14) is also shown as R-KuLSIF(direct). The kernel parameter a is determined based on the 
median of — To solve the optimization problems in the M-estimator s and KMM, we 

used the BFGS method implemented in the optim function in R (R Development Core Team, 
2009), and for R-KuLSIF(direct) we use the solve function. The results show that the number 
of iterations in optimization is highly correlated with the condition number of the Hessian 
matrices in Table 1. Although the practical computational time would depend on various issues 
such as stopping rules, our theoretical results were shown to be in good agreement with the 
empirical results. Thus, the R-KuLSIF would be a stable and computationally efficient density- 
ratio estimator. We observe that numerical optimization methods such as the quasi-Newton 
method are competitive with numerical algorithms for solving linear equations (such as the 
LU or Cholesky methods), especially when the sample size or the number of parameters is 
large. Thus, our results obtained in this paper would be useful in large sample cases — common 
situations in practical applications. 

8 Conclusions 

We considered the problem of estimating the ratio of two probability densities and investigated 
theoretical properties of the kernel least-squares estimator called KuLSIF. We studied the condi- 
tion number of Hessian matrices, and showed that KuLSIF has a smaller condition number than 
the other methods. Since the condition number determines the convergence rate of optimization 
and the numerical stability, KuLSIF will have a preferable numerical properties to the other 
methods. We further showed that R-KuLSIF, which is an alternative formulation of KuLSIF, 
possesses an even smaller condition number. 

Density ratio estimation could provide new approaches to various machine learning problems 
including covariate shift adaptation (Huang et al., 2007; Sugiyama et al., 2008a; Kanamori et al., 
2009; Bickel et al., 2009), outlier detection (Hido et al., 2008), and feature selection (Suzuki 
et al., 2008). Based on the theoretical guidance given in this paper, we will develop practical 
algorithms for a wide-range of applications in the future work. 
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Table 1: Condition numbers of each Hessian matrix. 
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-f^R-KuLSIF 


-ffKuLSIF 


Hkmm 


/i = 0.2 


/i = 0.5 


7 = 2 


7 = 5 


7= 10 


20 


1.6e+01 


3.8e+00 


6.4e+01 


2.7e+02 


9.0e+01 


1.4e+03 


l.le+02 


7.4e+01 


6.9e+01 


50 


7.1e+01 


8.1e+00 


5.9e+02 


5.1e+03 


7.6e+02 


4.8e+03 
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7.1e+02 


6.5e+02 
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3.8e+04 
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2.9e+03 


4.4e+01 
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5.7e+06 


1.6e+05 


5.8e+05 


2.5e+05 


1.6e+05 


1.4e+05 
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5.9C+03 


5.8C+01 


3.4e+05 


2.0e+07 


4.2e+05 


1.5e+06 


6.8e+05 


4.3C+05 


3.8C+05 


500 


l.Oc+04 


7.3O+01 


7.5C+05 


5.5C+07 


9.2e+05 


3.1C+06 


1.5C+06 


9.4e+05 


8.3e+05 
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5.7C+03 


50 


4.2e+03 


2.8e+01 


1.2e+05 


3.4e+06 


1.6e+05 


7.7e+05 


2.3e+05 


1.5e+05 


1.3e+05 


100 


3.1e+04 


5.5e+01 


1.7e+06 


9.6e+07 


2.4e+06 


1.2e+07 


3.4e+06 


2.2e+06 


1.9e+06 


200 


2.6e+05 


l.lc+02 


2.8C+07 


3.1C+09 


3.9C+07 


2.1C+08 


5.6C+07 


3.5C+07 


3.2C+07 


300 


l.Oe+06 


1.6e+02 


1.7e+08 


2.7e+10 


2.3e+08 


1.2e+09 


3.3e+08 


2.1e+08 


1.9e+08 


400 


3.0e+06 


2.1e+02 


6.3e+08 


1.4e+ll 


8.7e+08 


5.0e+09 


1.3e+09 


7.9e+08 


7.0e+08 


500 


6.5e+06 


2.7e+02 


1.7e+09 


4.6e+ll 


2.4e+09 


1.3e+10 


3.4e+09 


2.2e+09 


1.9e+09 



A Proof of Theorem 1 

Let us define the bracketing entropy of the set of functions. For distribution function P, define 
the L2 metric 

\g\\p = ( / \g\^dP' 



and let L2{P) be the metric space defined by this distance. For any fixed 6 > 0, a covering for 
function class S using the metric L2 (P) is a collection of functions which allow S to be covered 
using L2{P) balls of radius 6 centered at these functions. Let Nb{(>, S, P) be the smallest value 
of A'^ for which there exist pairs of functions {igj,gf) G L2{P) x L2{P) \ j = 1,...,N} such 
that \\gj — gf\\p < S, and such that for each s G 5, there exists j such that gj < s < . Then, 



Table 2: Averages of the computation time and the number of iterations in the BFGS method 
over 50 runs. 





n = 1000, m = 1000 


n = 4000, m = 4000 


n = 6000, m = 6000 


Estimator 


Comput. 

time (sec.) 


Number of 

iterations 


Comput. 

time (sec.) 


Number of 
iterations 


Comput. 

time (sec.) 


Number of 

iterations 


R-KuLSIF 
KuLSIF 
KMM 
KL 

R-KuLSIF(direct) 


1.44 
2.25 
51.83 
27.63 
0.46 


23.02 
38.36 

453.68 
329.06 


34.94 
53.93 
591.44 
1180.72 
28.85 


29.98 
48.76 
400.74 
634.32 


71.69 
107.79 
1091.69 
2718.89 
87.06 


30.74 
47.32 
373.08 
669.20 



(CPU: Xeon X5482, 3.20GHz, Memory: 32GB, OS: Linux 2.6.18) 
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Hb{S, S, P) = log Nb{5, S,P) is called the bracketing entropy of S (van de Geer, 2000). 

Let Ti. be the RKHS endowed with the Gaussian kernels, k{x,y) = e^"^^^" /^'^ . The norm 
and inner product on 7i are denoted by || • and (•, respectively. Let || • ||oo be the infinity 
norm. For w eH, we have \\w\\p < \\w\\oo < II^^IIh) because for any x G Z, the inequalities 

|'u;(a;)| = \{w,k{-,x))n\ < W^Wh supk{x,x) = \\w\\n 

X 

holds. The set Z, which is the domain of functions in TL, is assumed to be compact. Let 
g = {v'^ \ ve n}. Let Hm and Gm be 

1-Lm = {ven\ \\v\\n < M}, 

Gm = {v^\ve n^} = {geg\ Jig) < M}, (35) 
where J{g) is a measure of complexity defined as 

J{g)=mf{\\v\\l, \ven, v^ = g}. 

It is straightforward to verify the second equality of (35). According to Zhou (2002), the 
bracketing entropy of TCm satisfies, for infinitesimally small 7 > 0, the condition 

Hb{6,Hm,P) = Ofyj . (36) 

More precisely, Zhou (2002) have proved that the entropy number with the supremum norm is 
bounded above by 0{{M/6y). In addition, the bracketing entropy Hb{6,Hm, P) is bounded 
above by the entropy number with the supremum norm due to Lemma 2.1 in van de Geer (2000). 
The following proposition is crucial to prove the convergence property of KuLSIF. 

Proposition 2 (Lemma 5.14 in van de Geer (2000)). Let a map I{g) be a measure of complexity 
of g ^ G , where I is a non-negative functional on G and I{go) < 00. Then, we define Gm = 
{g £ G \ I{g) < M} satisfying G = ^m>iGm- Suppose that there exist cq > and < 7 < 2 
such that 



sup - c/oIIp < cqM, sup \\g - goWoo < cqM, for all 5 >Q, 

\\g-9o\\p<S 



and that Hb{S,G m, P) = O (M/Sy . Then, we have 



sup 

gee 



J{g-go)d{P-Pn] 



D{g) 



where D[g) is defined as 



^2/(2+7) 



and ay b denotes max{a, 6}. 
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We use Proposition 2 to derive an upper bound of J{w — WQ)d{Q — Qm) and J {vP' — WQ)d{P — 

Pn). 

Lemma 1. The bracketing entropy oJQm is hounded above as 

HB{5,gM,P) = Ofyj 

for any small 7 > 0. 

Proof. Let Vi,vY,V2,V2,---,v^,v^ G L2{P) be coverings of "T^y^g in the sense of bracketing, 
such that Ibf — ^'[^||p < <^ holds for i = 1, . . . , We can choose these functions such that 



< V M is satisfied for alH = 1, . . . , A^, since for any f G /tj, the inequahty \\v\\oo < 



^llw < holds. For example, replace v^'^^'' with min{\/M , max{— \/M, t;^'''^''}} e L2{P) 



Let vj^ and be 



(vf v(^{x) > 0, 

^f(x)<0, 
^0 i'f'(x) < < (x), 

= max{(t;f)2, (z;f)2}, 



for i = 1, . . . , A^. Then, vf' < vf holds. Moreover, for any v satisfying vf' < v < , we 

have vf' < v"^ < vf . By definition, we also have 

< vUx)-vf{x) < max{|t;f(x)2-t;f(x)2|, |i;f (x) - (x)!^} 

< (K^(x)| + |t;f(x)|)-K^(x)-t;f(x)| < 2VM\vf (x) - vHx)\, 

and thus, H^^^ — vf"\\p < 2-\/M\\vY — ■wf ||p holds. Due to (36), we obtain 



HB{2VMd,gM,P) < HB{d,nn^,P) = o 



M 



Hence, Hsid, Gm, P) = O {M/S^ holds. 



□ 



Lemma 2. Assume the condition of Theorem 1. Then, for the KuLSIF estimator w, we have 



I 



{w - WQ)d{Q - Qm) 



_ yjl)d{P - P„ 



\wo — W\\p 



l-'^/2|U-rvl|7/2 



w 



H 



V 



mm 



m 



m 



2/(2+7) I ' 



\w-wo\\^-''/\i + \\w\\h)^+^/^ . . \\w\ 



V 



n2/(2+7) \ 



where > is an infinitesimally small value. 
Proof. There exists cq > such that 

sup — iuqIIp < cqM, sup I 

\\w-wo\\p<5 



""^-■"^olloo < CqM, 



sup llg" — IUoIIp < CqM, sup | 

Q&Gm 9&Qm 

h-<\\p<s 



l^-W^olloo < CqM. 



(37) 

(38) 
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The inequalities in (38) are derived as follows. For g G Qm, there exists v eH such that = g 

|2 

m 



and ll^^ll?/ < M, and then, we have 



\\g-wl\\p < llff-wolloo < ll^'IlL + IkoIlL 

< Ibllw + lkoIlL < M + lkoIlL < coM, (M>1). 
In the same way, (37) also holds. Therefore, due to Proposition 2 and (37), we have 



sup 
wen 



J {wq - w)d{Q - Qm) 



D{w) 



where D(w) is defined as 



D{w) 



\Wq — W\\p 



l-'y/2||,„||7/2 



H 



V 



m 



\m\H 

m2/(2+7) ■ 



In the same way, we have 



sup 



j {W^ - wl)d{P - Pn 



E{w) 



where E{w) is defined as 



E{w) 



I 2 2i|l— 7/2 T/ 2 



n2/(2+7) ' 



Note that — w^^p < i\\wo\\oo + ~ ^o||p = 0{{1 + ||i(;||-^)||i(; — u)o||p) and 

J(u;2) < II wlll^. Then, we obtain 



E{w) < 



|i/;-u;o||p-^/'(l + ||w;||H)'+^/' 



V 



w 



n 



„2/(2+7) • 



□ 



Now we show the proof of Theorem 1. 
Proof. The estimator w satisfies the inequality 

^ J w'dPn- J wdQm + ^\\w\\H - \J wldPn- J w^dQ^ 
Then, we have 



\W — WqWp 



< 



+ 



J {wo - w)dQ + ^ jiw'^- wl)dP 
j {wo - w)dQ + ^j{w'- wl)dP 



l^llw- 
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As a result, we have 
1 



< 



^„w-wofp + ^\\w\\'^ 



{{v - Wo)d{Q - Qr, 



1 






/ 


+ 2 





J {W^ - wl)d{P - Pn 



+ 



' \\wo-wtp/\l + \\w\\ny+^/\ , (1 + 11^117^)' \ 
Am (n A to) 2/ (2+7) J 



where Lemma 2 is used. 

We need to study three possibihties: 



^\\wo - wfp + ^\\w\\n < Op 
^\\wo - wfp + ^\\w\\j^ < Op 



' \\wo-wf-^/\l + \\w\\ny+^/' ^ 
\/n Am 

' {1 + \\w\\h? \ 
(n A to)2/(2+7) J ■ 



One of the above inequahties should be satisfied. We study each inequality below. 
Case (39) : we have 

l\\wo-w\\l<Op{X), ^\\wfn < Op{X), 

and hence the inequalities ||w;o — w\\p < Op(A^/2) a^jj WwW-h < Op{l) hold. 
Case (40) : we have 

II -||2 ^ n (\\^o-w\\'p^\l + \\w\\nr^/'' 
\\wo - w\\p < Op' 

X\\wfn < Op 
The first inequality provides 

\\yjo - w\\p < Op 
Thus, the second inequality leads to 

MH\h < Op 




1 + \\w\ 



n 



(nAm)V(2+7)) • 

\\wo-w\\]r^^\l + \\id\\H)'^''/^^ 



(n A m)^/2 



< Op 



1 + \\w\ 



H 



y-^/' (i + iHi^)i+7/2 - 



(n Am) 1/(2+7) J 

ii+\\w\\Hr \ 

^ V(^Am)2/(2+7) J 



(n A m)i/2 
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Hence, we have 

for infinitesimally small 7 > 0. Then, we obtain 

Case (41): we have 

II -i|2 ^ n ^ (1 + 11^11^)' A ,11-112 ^ n fS}L±M2l)l\ 

Then, as shown in the case (40), we have \\w\\h = Op(l). Hence, we obtain 



\wo — u;||p < Oj 



((nAm)V(2+7)) - 



□ 



B Leave-one-out Cross-validation of KuLSIF 

The procedure to compute the leavc-onc-out cross-validation score of KuLSIF is presented here. 
Let X}? G s)fi('^-i)x('^-i) and i^iJ = K^^^^'^ G s)fj(n-i)x(m-i) ^^le Gram matrices of samples 
except X£ and y^, respectively. According to Theorem 3, the estimated parameters a^^^ and 
of 

^W(., - 

is equal to 



where /n-i denotes the (n — 1) by (n — 1) identity matrix. Hence, the parameter a^^^ is the 
solution of the following convex quadratic problem, 

min Ja^(Ki? + (n - l)A/„-i)a + —^—l1-iKSa, a G sR"-\ (42) 

^ (777/ -L )/\ 

The same solution can be obtained by solving 

min ^a'^{Kii + (n - l)XIn)a + J^~Y)X^^"^ ~ em,eV K2ia, ^^^^ 
s.t. a e ^IT', at = 0, 

where Sm/ G 3?™ is the standard unit vector with only the ^-th component being 1. The optimal 
solution of (43) denoted by a^^^ is equal to 

a^^) = (Ku + (n - l)A/„)-^ (~ (m - 1)A ^^^^^'" ~ ^""'^^ ~ , 
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where q is determined so that of' = 0. The estimator a^^^ G 3??^* ^ is equal to 
the (n — l)-dimensional vector consisting of a^^^ except the ^-th component, i.e., 5^^^ = 

yu.^ , . . . ,LX^_^,uc^_^_^, . . . ,u.n ) ■ 

The parameters of the leave-one-out estimator, 

A = (a(^ , . . . , a^'^^'") ("^'"^ , B = {P^^^ P^""^""^ ) e ("^'"^ 

also have analytic expressions. Let G G 3f?"^" be G = {Kn + {n - l)\In)~\ and E e §fi"»x("^"») 
be the matrix defined as 



Eij = 



Let 5 G sft"x("^H be 
and Tg sifi"x("^"^) be 

Then, we obtain 



i = j. 




T 



A = GiS-T), B= ^ E. 

(m — IjA 



Let Kx G ^inAm)x{n+m) -^^^ sub-matrix of {K11K12) formed by the first n A m rows and 
all columns. Similarly, let Ky G sfiinAm)x{n+m) ^j^g sub-matrix of {K21K22) formed by the 
first n Am rows and all columns. Let the product U *U' he the element- wise multiplication of 
matrices U and U' of the same size, i.e., the element is given by UijU[y Then, we have 



Wx = 


(^BW(Xi),. 




= {Kx * (A^ B^))lr,+rn, 


Wy = 






-- {Ky * (A^ B''))ln+m, 


WX+ = 






= max{idx, 0}, 


WY+ = 






= maxjiBy , 0}, 



where the max operation for a vector is applied in the element-wise manner. As a result, LOOCV 
(18) is equal to 



LOOCV = - l^AmW+l 



C Proof of Eq. (25) 

Let k{A) be the condition number of the symmetric positive definite matrix A, then we prove 
that the following equality 

min k(SAS) = max | ^ , 1 

S:k{S)<C [ 
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holds. The same equality holds for the condition number defined through singular values for 
non-symmetric matrices. Wc prove the case that S" is a symmetric matrix for simplicity. Note 
that k{S'^) = K{Sf and k{S) = k{S-^), thus we obtain Eq. (25), i.e. 



Proof. First, we prove min5.K(5')<c > max{^^^, 1}. 

The matrix A is symmetric positive definite, thus, there exists an orthogonal matrix Q and 
a diagonal matrix A = diag(Ai, . . . , A„) such that A = QAQ^ . The eigenvalues are arranged 
in the decreasing order, i.e., Ai > A2 > • • • > A„ > 0. In the similar way, let 5" be PDP^, 
where P is an orthogonal matrix and D = diag((ii, . . . , d„) is a diagonal matrix such that 
di > d2 > ■ ■ ■ > dn > and di/dn < C. Hence, 



Let P be BJ which is also an orthogonal matrix. The maximum eigenvalue of DRABJ D is 
given as 



Let R = (ri, . . . , r^), where G 3ft", and we choose Xi such that rj Dx\ = for i = 2, . . . , n 
and ||a;i|| = L Then, 



Prom the assumption on x\, Dx\ is represented as cr\ for some c, and we have {xjDr\Y = 
c? = xJD^xi > d^. Hence, we have 




k{SAS) 



k{PDP^ QAQ^PDP^) 



k{DP^ QAQ^PD). 



max X ' DRAR ' Dx. 

||a;||=i 



max x^DRAR^Dx > xJdRAR~^ Dxi = XAxl Dnf . 
\\x\\=i - ' 



\\X\\=1 



max x^ SASx > Aid^. 



On the other hand, the minimum eigenvalue of DRAR D is given as 



min x^ DRAR^ Dx. 

\\x\\=i 

We choose Xn such that rjDxn = Ofori = l,...,77, — 1. Then, 




■n 



< XnxJ^D^X, 

< Xndl 



■n 



(Schwarz inequality) 



As a result, the condition number of SAS is bounded below as 



k{SAS) > 




{di/dny - 



C2 ■ 
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Next, we prove m.ms:K{s)<c f^i^AS) < max{^^^, 1}. If k{A) < C^, the inequal- 
ity min5.^(5)<(7 k(SAS') = 1 holds, because we can choose S = A"^/^. Then, we prove 
^^^s-MS)<c i^iSAS) < if 1 < < k{A) is satisfied. 

Let S = QTQ^ with F be a diagonal matrix diag(7i, . . . , 7„), then k{SAS) = 
/«(diag(7^Ai, . . . ,7^A„)) holds. Let 71 = 1 and 7„ = C. Since 1 < < k{A) = Ai/A„ 
holds, for A; = 2, . . . , n — 1 we have 




1 < min <! C 

and thus, we obtain 

max |l, C 

Hence, there exists 'fk, fc = 2,...,n— 1 such that 





,n- 



1. 



max s 1, C 





< 7jt < min <^ C, 



Thus, 1 < 7A; < C holds for ah = 2, . . . , n - 1. Moreover, C^A„ < 7^Afc < Ai also 
holds. These inequalities imply k{S) = C and k{SAS) = Xi/{C^Xn) = k{A)/C^. Therefore 
^^^S:k{S)<c k{SAS) < ^ holds if 1 < C2 < k{A). □ 



D Proof of Theorem 4 

We show the proof of Theorem 4. 

Proof. Let wi be the constant function taking 1 over Z. In a universal RKHS, for any S > 0, 
there exists w G H such that 11 wx — w^lloo ^ ^- According to Appendix D in Horn and Johnson 
(1985), eigenvalues of a matrix are continuous on its entries, and thus so do the minimal and 
maximal eigenvalues and the condition number as long as the condition number is well-defined. 
Then, for any e > and for any ip satisfying V'"(l) = Ij there exists w eH such that 

Then, for fixed samples Xi, . . . , Xn, we find that 

sup{ko(-Dv,«') {weH} > Ko{In). 
On the other hand, for ip{z) = we obtain 

sup{kq{D^^w) \ w gH} = Ho(In)- 
Thus, (30) holds. □ 
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E Proof of Theorem 5 



The following lemma is the key to prove Theorem 5. 

Lemma 3. Suppose that the kernel function k satisfies the condition in Theorem 5, and that 
the expectation of il)" {w[Xi)) exists. The probability Pr(- • • ) is defined from the distribution 
of samples Xi, . . . , Xn, Yi, . . . , Ym- Then, there exists a positive constant e > such that the 
probability distribution of k{H) is bounded above by 

Pr {KiH) <5) < Fn (J) + ^-{E[i^"{w{X,))] + A), (44) 

where c is an arbitrary positive value. On the other hand, for any positive number c > 0, we 
have 

Pt:(^k{H)> K{Kn){l + j)) < I - Fn{c) (45) 
if the Gram matrix Kn is almost surely positive definite. 

Proof. Let ki be the i-th column vector of the Gram matrix Ku. Due to the condition on the 
kernel function, there exists a constant £ > such that 

Pr(Vi < {Kn)ij < 1, i,j = 1, . . . ,n) = 1, 

where the probability is induced from the joint probability of Xi, . . . , X„. Hence, 

Pr(£n < \\kif <n, i = l,...,n) = 1 (46) 

also holds. 

Let di be ijj"{w{Xi)), then the matrix H is represented as 

1 " 

H = -y" dikikj + XKn e SR'^''". 
n ^-^ 

1=1 

Let us define 

Yn = min Ha, = max aJ Ha. 

||a||=l ll'j||=l 

Yn and Zn are the minimal and maximal eigenvalues of H. Thus, the condition number of H is 
given as k{H) = Zn/Yn. 

We derive an upper bound of 1^ and a lower bound of Z„ to prove the first inequality (44) . 
The minimal eigenvalue is less than or equal to the average of all eigenvalues, and the sum of 
eigenvalues is equal to the trace of the matrix. Thus, we have 

Yn < -Til -y2dikikj + \Ku] < - f^di + X, 
n \ n ^ j n ^ 

\ i=l J %=i 
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where (46) was used. On the other hand, for any j = 1, . . . ,n, the inequahty 

1 " 

Zn = max — di(kja)^ + XaJ Kua 

\\a\\=l n ^ 
1=1 

1 " 

> max — dAkJ a)^ 
1 " 

> — '^^di{kj kj /WkjWf {kj/\\kj\\ is substituted into a) 



n . 
1=1 



n ■' 
> edj 



holds. The last inequality follows (46). Hence, we have 



Zn > £ max dj . 

3 



Therefore, for any 5 > 0, we have 

V.{k{H) <5) < Pr (^T^YdT^ ^ ■ ^^^^ 
The probability of the numerator in (47) is given as 

Pr(£max(ii < c\) = Fn {p^ > ci > 0. 
For the probability of the denominator in (47) , we use Markov's inequality: 

Prr(i^di + A)~^<C2') = PrQ^di + A> l/ca") < C2 (^[di] + A) , C2 > 0. 



Combining these two bounds^, we find 

' e maxj di 



Pr 



< CiC2^ < (^) + C2{E[di] + A). 



Therefore, for any 5 > and c > 0, wc have (44). 

We prove the second inequality (45) . Let ri and Tn be the maximal and minimal eigenvalues 
of Kii. Since all diagonal elements of Kn are less than or equal to one, we have < ri < 



^Let A, B, a, and b be four positive numbers. If j4 > o and B > h, then we have AB > ah. As the 
contraposition, if AB < ah, then ^ < o or S < 6 holds. 
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Tr Kii < n. Then, we have a lower bound of and an upper bound of as follows: 

1 " 

y„ = min — y^diikj a)^ + Aa Kua > At„, 

||a||=l n ^ 
1=1 

1 " 

Zn = max — dAkJ of' + \aJ Kua 

\\a\\=l n 

n 

max ^~\{kj of + \t\ 

ll<l|l=l 



=1 

maxj dj 



n „„„ . . 

1=1 

max,- d,- n 

= ^rf + An 

n 

< Ti max + Ati , 

i 



where the last inequality for Z„ follows from ti < n. Therefore, for any c > 0, we have 

Pr(K(/^)>.(Kn)(l + ^)) < P^( '^"l^'^''^ >^(^n)(l + f)^ 

= Pr^maxdj > 

= 1 — Pr ^max dj < 
= l-i^n(c). 



□ 



In Lemma 3, the distributions of Yn and Zn are separately computed. This idea is borrowed 
from smoothed analysis of the condition numbers (Sankar et al., 2006). In smoothed analysis, 
the probability 'Pt{k{H) > S) is bounded above to ensure that the condition number is unlikely 
to be large. In the above lemma, we used the same technique also for upper-bounding the 
probability of the form Pr(K(i^) < 5). As a result, we obtained the possible lowest order of the 
condition number k{H). 

Below, we show the proof of Theorem 5. 



Py:{K{H)<6n) < (^) + -{M + X). 



proof of Theorem 5. The inequality (44) in Lemma 3 provides 

Let c„ be es„ and Sn be o(s„) then, we obtain 

lim Pt:{k{H) < 5n) = 0. (48) 

n— »oo 

We prove another inequality. Due to the second inequality in Lemma 3, we have 



lim Vi:[k{H) > k{Ku)(1 + %) ) < 1 - lim FJtn) = 

n-^oo \ A J n—*oo 



(49) 



We complete the proof by combining (48) and (49) . □ 
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F Proof of Theorem 6 



We show the proof of Theorem 6 

Proof. Assume that ij)"{z) is not a constant function. Since Kn is non-singular, the vec- 
tor Kiia + j^K^lm takes an arbitrary value in 3^"^ by varying a G 3?". Hence, each di- 
agonal element of -Dy,,^ can take arbitrary values in an open subset 5 C 3?. We consider 
M^^jl{a.){R^ instead of M^^r. Suppose that there exists a matrix R such that the ma- 
trix 

R-^M^,R{a){R'')-^ = ^ diag(si, . . . , + X{R~')-^ 

is symmetric for any (si, . . . , s„) G S"'. Let a^- be the element of Kii{R'^)~^ , and tij be the 

clement of {R~^)~^. Then, the and elements of R~^M.i[._R{a)(R~^)~^ are equal to 
^SiQij + Xtij and ^sjUji + Xtji, respectively. Due to the assumption, the equality 

1 _ 1 

n ■' ■' n ■' ■' ■' 

holds for any Sj, Sj G S. When i ^ j, wc obtain i — and tij — tji. Thus, 

should be equal to some diagonal matrix, and {R~^)~^ is a symmetric matrix. Thus, there 
exists a diagonal matrix Q = diag(gi, . . . , g„) such that Ku = QR holds. As a result, we have 
iKii)ij = QiRij, iKu)ji = QjRji, Rij = Rji, and {Ku)ij = (ifii)ji. Hence we obtain 

{l^n)ij = QiRij = QjRij, 

and then, qi = qj or Rij = holds for any i and j. Since {Kii)ij is non-zero element, the only 
possibility is gi = g2 = • • • = Q'n / 0. Therefore, the diagonal matrix Q should be proportional 
to the identity matrix and there exists a constant c G such that the equality R = cKu holds. 
This equality contradicts the assumption. □ 
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